Cloudflare Cache For Crawlers
Why we cache bots
Search and AI crawlers make thousands of anonymous GET/HEAD hits against tenant storefronts. Every request used to traverse Caddy → Cloudflare → API Gateway → farfalla, forcing us to render full HTML repeatedly. Enabling cache at Cloudflare means those bots now receive a four-hour cached copy, while regular browsers bypass cache because their user agents do not match the crawler allow-list.
Current rule
- Ruleset:
http_request_cache_settings→ rule243fedab25584185a62473f4b68b16c9 - Match:
http.host contains "farfalla-entry-point.publica.la"and UA matches one ofAmazonbot|Anchor Browser|Applebot|archive.org_bot|bingbot|Bytespider|CCBot|ChatGPT-User|ClaudeBot|Claude-SearchBot|Claude-User|DuckAssistBot|FacebookBot|Googlebot|Google-CloudVertexBot|GPTBot|meta-externalagent|meta-externalfetcher|MistralAI-User|Novellum|OAI-SearchBot|PerplexityBot|Perplexity-User|PetalBot|ProRataInc|Timpibot - Action:
set_cache_settingscache: true,origin_cache_control: false,origin_error_page_passthru: trueedge_ttl.mode: override_origin,edge_ttl.default: 14400seconds (4 h)- Cache key uses the origin host/path/query + Geo dimension (per-country). Language dimension is currently disabled, so multilingual HTML should include locale in the URL itself.
Because Caddy always proxies tenant domains to farfalla-entry-point.publica.la, this filter applies to every storefront hostname even though Cloudflare technically only sees the internal host header.
Operational notes
- Testing: Use
CF-Cache-Statusheaders while hitting a tenant domain with one of the bot user agents (e.g.,curl -A GPTBot https://tenant-domain/library). ExpectMISS→HITafter the first request. - Purging: Invalidate specific URLs when content changes faster than four hours using
curl -X POST .../purge_cachewithfilespayload, or rely on farfalla’s existing purge hooks. - API snapshot: Fetch the live JSON anytime:
export CF_API_TOKEN=...
ZONE_ID="$(curl -s -H "Authorization: Bearer $CF_API_TOKEN" "https://api.cloudflare.com/client/v4/zones?name=publica.la" | jq -r '.result[0].id')"
curl -s -H "Authorization: Bearer $CF_API_TOKEN" "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/rulesets/56902a1ea0f64980a5dc883c2e602096" |
jq '.result.rules[] | select(.id=="243fedab25584185a62473f4b68b16c9")'
- Adjusting TTL: Increase
edge_ttl.defaultif cache hit ratio is high and purge automation is reliable. Drop it if crawlers complain about stale HTML. - Future exclusions: If bots can reach any authenticated or dynamic views (admin, checkout, preview, API), extend the rule expression with
not starts_with(http.request.uri.path, "/admin"), etc., before enabling cache there.
Troubleshooting checklist
- Cache miss for bots → confirm user agent string exactly matches one of the listed tokens.
- Bots still blocked → ensure WAF/firewall rules are not blocking those UAs on the proxy hostname.
- Wrong language served → ensure locale lives in the path/query when
user.langis not part of the cache key. - Logged-in pages cached → confirm the browser user agent (or any extensions) is not spoofing one of the crawler identifiers; if it is, the request will match the cache rule regardless of cookies.
JS smoke test
Run this Node.js (18+) snippet locally to exercise the rule. It hits every target twice per agent so you can observe MISS → HIT for bots and DYNAMIC for normal browsers. Hashing the body lets you confirm /library renders different HTML per hostname.
import crypto from 'node:crypto';
const targets = [
{ label: 'La Tercera /library', url: 'https://kiosco.latercera.com/library' },
{ label: 'Bajalibros /library', url: 'https://ar.bajalibros.com/library' },
{ label: 'Bajalibros publication', url: 'https://ar.bajalibros.com/library/publication/horoscopo-chino-2026' },
];
const agents = [
{ label: 'bot', ua: 'Googlebot', repeat: 2 },
{ label: 'browser', ua: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36', repeat: 1 },
];
for (const target of targets) {
for (const agent of agents) {
for (let i = 1; i <= agent.repeat; i += 1) {
const res = await fetch(target.url, { headers: { 'User-Agent': agent.ua } });
const body = await res.text();
const hash = crypto.createHash('sha1').update(body).digest('hex');
console.log(
`${agent.label.toUpperCase()} pass ${i} | ${target.label} | status=${res.status} | cf-cache-status=${res.headers.get('cf-cache-status')} | digest=${hash.slice(0,10)}`,
);
}
}
}